3 days ago i was looking through the gtdbtk manual and saw that
de_novo_wf was an option for analysis to create the trees,
from the description given:
knitr::include_url("https://ecogenomics.github.io/GTDBTk/commands/de_novo_wf.html")
i beleived this would be something i should do as it might produce
more accurate trees. sample 1Dt2d Enterobacter
cancerogenus had been placed by the classify_wf in the
previous gtdbtk analysis in the genus Pantoea, which lead me to
this search. after a bit of trial and error, i produced this
script
This ran as a slurm job on hawk (SCW) from rougly 20:10 on the 23rd to 01:00 on the 24th, totalling 4 hours and 50 minutes. The main parameters that i experimented with were
- #SBATCH --ntasks=5
- #SBATCH --time=24:00:00
- #SBATCH --mem=50g
- --cpus 10
I settled on these as being the “best”, however, it is entirely possible that they could be more optimised.
This analysis produced these files:
/scratch/scw2160/02_outputs/flye_asm/gtdb_tk_de_novo5/
.:
text.txt
ls
touch
list.txt
align
gtdbtk.bac120.decorated.tree
gtdbtk.bac120.decorated.tree-table
gtdbtk.log
identify
infer
gtdbtk.warnings.log
./align:
gtdbtk.bac120.msa.fasta.gz
gtdbtk.bac120.user_msa.fasta.gz
gtdbtk.bac120.filtered.tsv
./identify:
gtdbtk.ar53.markers_summary.tsv
gtdbtk.bac120.markers_summary.tsv
gtdbtk.translation_table_summary.tsv
gtdbtk.failed_genomes.tsv
./infer:
gtdbtk.bac120.decorated.tree
gtdbtk.bac120.decorated.tree-taxonomy
gtdbtk.bac120.decorated.tree-table
intermediate_results
./infer/intermediate_results:
gtdbtk.bac120.rooted.tree
gtdbtk.bac120.fasttree.log
gtdbtk.bac120.tree.log
gtdbtk.bac120.unrooted.tree
I then moved this gtdbtk.bac120.decorated.tree file into
Dendroscope for review, all 10 are on one tree, but 1Dt2d
is still being placed in the “wrong” genus. on review of its sister
accession on the ncbi database.
On the NCBI page for the sister accession, can be found a CheckM analysis that comes back with
completeness: 90%
contamination: 3.6%
Taxonomy check status: failed
Upon viewing the tree in Dendroscope, the joining node has the label
0.968. This I believe to be the probability the
relationship is correct. this implies they are the same species, and the
online sample is also identified as Enterobacter cancerogenus.
However, due to the checkm analysis i find it plausible that they both
have been misidentified and are in reality Pantoea species, i find this
the most parsimonious explanation. I will follow this up with a CheckM
analysis of my own on 1Dt2d
This was a “technical spike” or proof of concept for
de_novo_wf
📌 TODO: do another Checkm analysis on 1Dt2d to see if the values are similar to the online sample
i wanted to see if the outputs of checkm differed from checkm2 so ran that on hawk. I also began recreating the innital table for metadata about the bangor samples, in the spirit of automation, a less manual approach was chosen this time around.
using this
script i ran a slurm job on hawk under the lineage_wf
of CheckM for all 10 Bangor-made samples, this took just 4 minutes. I
also worked on exporting the data i want to tabulate off of hawk. The
past way i did this was by manually entering each file
and noting down the important characteristics. However, because there
are going to be more samples(and i wanted to be clever) i decided to use
a more automated process. This was done by identifying different
documents in the flye directories on hawk, specifically files called
“assembly_info.txt” which contain the same information, but are vastly
more exportable. These are stored here: cd
/scratch/scw2160/02_outputs/flye_asm/flye_asm_[accession]/ using this
script i exported them off of hawk.
the CheckM analysis produced this output directory. My export script exported all 10 “assembly_info.txt” files to a directory in my home directory, as well as adding their accession ID to the name, this is important as otherwise i wouldnt know which belonged to what accession. I then brought them down and stored them here.
with it being christmas i did not take a serious look at the significance of the outputs of either, so that is what i plan to do next so that i can have some conclusions, maybe by the end of tomorrow. In conclusion, this process is only half done and will continue into the following entry(s).
📌 TODO: Complete analysis / processing of checkm and table stuff
Basic stuff, finding out what i can do to the checkm analysis to turn it into publishable data, making the table using R scripts and the output files. exams in January are rapidly approaching, so i really do need to pick up the pace if i want to get this done before i begin revising. there is not a real goal for this exact piece, i just want to know how congruent the outputs from CheckM are to CheckM2.
Using the CheckM
documentation. I found the output that i needed was bin_stats_ext.tsv.
this is where the contamination(and a lot of other fun looking stats)
are held. I can thusly import this file to R to try and extract usefull
stuff / tabulate. Over the course of the day i developed and improved CheckMtablegenerator.R.
which creates the table layed out in results below. This uses a few
packages, one of which was new to me, gt. This is a very
useful package for table generation.
| Table 1 - subset of data obtained from CheckM analysis | ||||||||||
| of samples taken from the skin microbiome of Dendrobates tinctorius, including metrics of quality. | ||||||||||
| Accession | Marker lineage | Completeness | Contamination | GC | GC std | Genome size | contigs | Longest contig | Coding density | Predicted genes |
|---|---|---|---|---|---|---|---|---|---|---|
| 1Dt1h | o__Actinomycetales | 100.00 | 0.00 | 65.519% | 4.284% | 4,336,578 | 4 | 2,051,556 | 90.334% | 3,818 |
| 1Dt2d | f__Enterobacteriaceae | 100.00 | 0.90 | 53.889% | 1.270% | 5,345,294 | 6 | 2,288,975 | 87.931% | 4,937 |
| 1Dt100h | o__Actinomycetales | 98.84 | 0.19 | 52.446% | 0.564% | 2,379,473 | 2 | 2,349,045 | 89.586% | 2,071 |
| 2Dt1l | o__Actinomycetales | 98.82 | 2.32 | 67.742% | 0.844% | 3,996,506 | 16 | 1,089,026 | 93.707% | 3,915 |
| 3Dt1c | o__Sphingomonadales | 97.23 | 1.19 | 66.487% | 1.385% | 3,636,266 | 86 | 275,317 | 90.660% | 3,433 |
| 3Dt2j | o__Sphingomonadales | 96.19 | 1.21 | 65.644% | 1.752% | 2,813,977 | 75 | 195,248 | 92.779% | 2,786 |
| 3Dt2c | o__Sphingomonadales | 95.75 | 0.85 | 66.093% | 1.604% | 3,819,610 | 84 | 247,364 | 89.769% | 3,516 |
| 3Dt2h | o__Actinomycetales | 87.98 | 0.51 | 70.327% | 1.317% | 3,006,227 | 166 | 125,244 | 90.324% | 2,978 |
| 1Dt100g | o__Sphingomonadales | 76.77 | 0.00 | 68.105% | 1.332% | 2,104,071 | 153 | 74,994 | 92.645% | 2,179 |
| 2Dt1e | o__Actinomycetales | 44.74 | 0.00 | 71.162% | 1.565% | 2,332,403 | 149 | 70,689 | 89.402% | 2,215 |
| Source: CheckM lineage_wf performed on 2024-12-25 | ||||||||||
As mentioned, this table is a subset of the data, there was more outputted by CheckM, some of which overlaps with other analyses, but i did not know if it was relevent to include, so left it out, however, creating a new one with the omitted fields in would not be hard. This table will pair with one produced from CheckM2, because from what i’ve seen one is kind of an improvement on the other, but they kind of do different things, its confusing and thus why i want to do the comparison for future reference. I did not know what counted as “significantly complete”, so arbitrarily placed the value at 90%. This analysis shows that 7 of the 10 samples are complete or near enough, with only 3 falling below my line. It should be noted that 2Dt1e, the lowest sample is only found to be 45% complete, this is a large outlier from the rest of the range. Contamination is fairly low in all, none above 3% and 3 samples on 0.00%.
The main purpose of doing this is to compare with Checkm2, so im not sure what conclusions i can get from just this, other than many of the samples look sufficiently complete and i learned a fair amount from this experience. I think this will act as a good first step to compare future analysis against, for confirmation, as many upcoming analyses will give repeat data in some areas. Some of the field, for example “GC” are odd and i dont quite understand their significance, perhaps Aaron can give more info when he gets to this part of the notebook.
📌 TODO: finish the straight flye analysis / start on CheckM2
Today i wanted to finish off the work from yesterday and start on CheckM2. this should produce two tables that i can use to compare with the first one from Checkm, maybe find out which method is superior?
The order i did the two main focuses in shifted greatly during the day, with me bouncing between flye and CheckM2 analyses, so this section will have subheaders for better readability
I created an R script, “rawflyeanalysis.R”. Basic stuff, imports the text files i chose to download(i wonder if i could download the flye.log files and use a sub() to cut out the non-useful stuff). This then sorts and processes the data into the table seen below using a few packages, Table 2: Assembly_info.txt.
Just before lunch(11:45-ish) i set off the slurm job to run CheckM2 on hawk. Let’s see how that goes after lunch, one thing of note though, i did make a minor modification, in the original file, i was running it on the fastas one at a time(?) weird, anyway, i changed it to work on the directory containing all 10, so it should do them all at once. run_checkm2TN.sh.
After lunch, firstly i went back to the flye.log analysis i wanted to do largely for posterity. This was horrifyingly complex due to the “complex” nature of the flye.log files. I did this by creating a modified copy of my assembly_info.txt file grabber shell script, called flye.loggetter.sh, using ctrl + h to find and replace instead of manually changing things. I added this to an existing output directory in my home section of hawk, which i exported using MobaXterm(do i need to include a section on how i use MobaXterm?) to my computer. This is the folder called flyestuff, which contains both the flye.log and the assembly_info.txt files for each accession, i grouped them as it seemed relevent for organisation and didnt hinder the coding process at all. The R file i used to tabulate this is flyeloganalysis.R. This produced Table 3: Flye.log
This started with doing
cd /scratch/scw2160/02_outputs/flye_asm/CheckM2_20241228 to
see the outputs, the job took roughly half an hour, ending at 12:15 pm.
Firstly, i copied the output directory to my home area, then once again
used MobaXterm to bring it down to the git repo, here.
Then, i began work on the R file to turn the
quality_report.tsv file. This file was formatted very well,
so turning it into a table was very simple, done in CheckM2tablegenerator.R.
This produced Table 4: CheckM2
Note: there were other outputs to CheckM2, they were very interesting, im going to make the next section about exploring that possiblity, but they dont relate to this analysis.
| Table 2 - Data obtained from flye outputs | ||||
| of samples taken from the skin microbiome of Dendrobates tinctorius | ||||
| Accession | Number of contigs | Total length | Largest fragment | Mean fragment length |
|---|---|---|---|---|
| 3Dt2h | 166 | 3,006,227 | 125,244 | 18,109.80 |
| 1Dt100g | 153 | 2,104,071 | 74,994 | 13,752.10 |
| 2Dt1e | 149 | 2,332,403 | 70,689 | 15,653.71 |
| 3Dt1c | 86 | 3,636,266 | 275,317 | 42,282.16 |
| 3Dt2c | 84 | 3,819,610 | 247,364 | 45,471.55 |
| 3Dt2j | 75 | 2,813,977 | 195,248 | 37,519.69 |
| 2Dt1l | 16 | 3,996,506 | 1,089,026 | 249,781.62 |
| 1Dt2d | 6 | 5,345,294 | 2,288,975 | 890,882.33 |
| 1Dt1h | 4 | 4,336,578 | 2,051,556 | 1,084,144.50 |
| 1Dt100h | 2 | 2,379,473 | 2,349,045 | 1,189,736.50 |
| Source: assembly_info.txt files produced in flye_asm analysis, August 27th 2024 | ||||
| coverage could not be calculated from this method as no result matched what was found in the flye.log files | ||||
The basic flye output analysis produced this table, as mentioned, i couldn’t get coverage to match what was on the flye.log file that i got the original of this table from, it is possible that file is the one that is wrong, but genome coverage is kind of all over the place, so i just omitted it as i don’t think its one of the mandatory fields. I chose to sort the table by “Number of contigs” rather arbitrarily, but it does show something interesting in that why is there such a large range in fragment number?
| Table 3 - Data obtained from flye.log outputs | ||||||
| of samples taken from the skin microbiome of Dendrobates tinctorius | ||||||
| Accession | Fragments | Total length | Fragments N50 | Largest fragment | Scaffolds | Mean coverage |
|---|---|---|---|---|---|---|
| 3Dt2h | 166 | 3,006,227 | 26,385 | 125,244 | 0 | 390 |
| 1Dt100g | 153 | 2,104,071 | 18,022 | 74,994 | 0 | 318 |
| 2Dt1e | 149 | 2,332,403 | 22,011 | 70,689 | 0 | 344 |
| 3Dt1c | 86 | 3,636,266 | 63,365 | 275,317 | 0 | 320 |
| 3Dt2c | 84 | 3,819,610 | 69,733 | 247,364 | 0 | 190 |
| 3Dt2j | 75 | 2,813,977 | 71,970 | 195,248 | 0 | 350 |
| 2Dt1l | 16 | 3,996,506 | 345,773 | 1,089,026 | 0 | 289 |
| 1Dt2d | 6 | 5,345,294 | 1,019,507 | 2,288,975 | 0 | 225 |
| 1Dt1h | 4 | 4,336,578 | 1,857,409 | 2,051,556 | 0 | 404 |
| 1Dt100h | 2 | 2,379,473 | 2,349,045 | 2,349,045 | 0 | 224 |
| Source: flye.log files produced in flye_asm analysis, August 27th 2024 | ||||||
This is a table of the type of stuff i first collated in the Summer, it shares many similarities and cross overs with Table 2, neither have quality statistics, but both are metadata largely concerning the length and fragment, a marked difference is this one comes with a concrete value of coverage. Comparative analysis of them feels like a thing that would happen in the conclusion to this document.
| Table 4 - quality analysis from CheckM2 | |||||
| based on samples taken from the skin microbiome of Dendrobates tinctorius | |||||
| Accession | Completeness | Contamination | Completeness model used | Translation table used | Additional notes |
|---|---|---|---|---|---|
| 1Dt1h | 100.00 | 0.31 | Neural Network (Specific Model) | 11 | None |
| 1Dt2d | 100.00 | 0.87 | Neural Network (Specific Model) | 11 | None |
| 2Dt1l | 100.00 | 0.27 | Neural Network (Specific Model) | 11 | None |
| 3Dt2c | 99.94 | 0.39 | Neural Network (Specific Model) | 11 | None |
| 3Dt2j | 99.80 | 0.00 | Neural Network (Specific Model) | 11 | None |
| 1Dt100h | 99.12 | 0.00 | Gradient Boost (General Model) | 11 | None |
| 3Dt1c | 98.34 | 0.86 | Neural Network (Specific Model) | 11 | None |
| 3Dt2h | 86.04 | 0.10 | Neural Network (Specific Model) | 11 | None |
| 1Dt100g | 78.42 | 0.33 | Neural Network (Specific Model) | 11 | None |
| 2Dt1e | 48.29 | 0.08 | Neural Network (Specific Model) | 11 | None |
| Source: CheckM2 predict performed on 2024-12-28 | |||||
This is the table of CheckM2 analysis, there are many differences in which fields are produced, also, there appears to be differences in the values themselves for Completeness and Contamination, perhaps part of the comparitive analysis, done in the conclusion should be the date at which both were at, as older versions will probably have less accurate values, and this may cause the difference as genomics is a fast evolving field. A Noteable difference is that CheckM used an internal diamond database, whereas CheckM2 required me to download my own. Exact comparison will be conducted in the conclusion.
Thus far, i think that i got all of the fields in Table 2 in a more reliable format in Table 1, but with the quality checks as well, both tables suffer weirdness with “cov.” and “GC” are they the same? are they both coverage? it is as mysterious as it is annoying. Let’s see how they match up to CheckM2. It is getting late, i think that tomorrow is going to be the comparitive analysis and then moving onto the fun-looking checkm2 stuff.
📌 ?: CONCLUSION: how do comparison of all 4 tables next to each other without creating a monstrocity of indented code / something about pills/ find date difference in version history between CheckM and CheckM2 comparitive analysis/
Due to this section being largely about the conclusion from the last week, the other sections may be light, as this is largely just a wrap up i was too tired to do yesterday.
One thing that might be relevent to note is that i changed the Markdown file so that the tables generated above now come from .csv output files. This helps on cutting down the time taken to knit the document and means that there are not multiple instances of code across the document, as the results section down here will include all 4 tables in one form or another. (basically, this increases flexibility of how the data can be used). In order to find the version history for CheckM, i went here ______. For CheckM2, i went here ______ and looked at the comit history as the version on Hawk is a developer version, not one of the official release, beginning at ____.
| Table 5 - Comparison of quality metrics | ||||
| for the same fasta files between CheckM/1.1.3 and CheckM2/0.1.3 | ||||
| Accession |
CheckM/1.1.3
|
CheckM2/0.1.3
|
||
|---|---|---|---|---|
| Completeness | Contamination | Completeness | Contamination | |
| 1Dt100g | 76.77 | 0.00 | 78.42 | 0.33 |
| 1Dt100h | 98.84 | 0.19 | 99.12 | 0.00 |
| 1Dt1h | 100.00 | 0.00 | 100.00 | 0.31 |
| 1Dt2d | 100.00 | 0.90 | 100.00 | 0.87 |
| 2Dt1e | 44.74 | 0.00 | 48.29 | 0.08 |
| 2Dt1l | 98.82 | 2.32 | 100.00 | 0.27 |
| 3Dt1c | 97.23 | 1.19 | 98.34 | 0.86 |
| 3Dt2c | 95.75 | 0.85 | 99.94 | 0.39 |
| 3Dt2h | 87.98 | 0.51 | 86.04 | 0.10 |
| 3Dt2j | 96.19 | 1.21 | 99.80 | 0.00 |
| Source: CheckM lineage_wf performed on 2024-12-25 CheckM2 predict performed on 2024-12-28 |
||||
📌 ?: AFTER LUNCH: pill stuff for comparison of programs directly, date the programs, do the comparison for the “raw”/ no package analyses. / look in log files for diamond version for M and M2
| Table 2 - Data obtained from flye outputs | ||||
| of samples taken from the skin microbiome of Dendrobates tinctorius | ||||
| Accession | Number of contigs | Total length | Largest fragment | Mean fragment length |
|---|---|---|---|---|
| 3Dt2h | 166 | 3,006,227 | 125,244 | 18,109.80 |
| 1Dt100g | 153 | 2,104,071 | 74,994 | 13,752.10 |
| 2Dt1e | 149 | 2,332,403 | 70,689 | 15,653.71 |
| 3Dt1c | 86 | 3,636,266 | 275,317 | 42,282.16 |
| 3Dt2c | 84 | 3,819,610 | 247,364 | 45,471.55 |
| 3Dt2j | 75 | 2,813,977 | 195,248 | 37,519.69 |
| 2Dt1l | 16 | 3,996,506 | 1,089,026 | 249,781.62 |
| 1Dt2d | 6 | 5,345,294 | 2,288,975 | 890,882.33 |
| 1Dt1h | 4 | 4,336,578 | 2,051,556 | 1,084,144.50 |
| 1Dt100h | 2 | 2,379,473 | 2,349,045 | 1,189,736.50 |
| Source: assembly_info.txt files produced in flye_asm analysis, August 27th 2024 | ||||
| coverage could not be calculated from this method as no result matched what was found in the flye.log files | ||||
| Table 3 - Data obtained from flye.log outputs | ||||||
| of samples taken from the skin microbiome of Dendrobates tinctorius | ||||||
| Accession | Fragments | Total length | Fragments N50 | Largest fragment | Scaffolds | Mean coverage |
|---|---|---|---|---|---|---|
| 3Dt2h | 166 | 3,006,227 | 26,385 | 125,244 | 0 | 390 |
| 1Dt100g | 153 | 2,104,071 | 18,022 | 74,994 | 0 | 318 |
| 2Dt1e | 149 | 2,332,403 | 22,011 | 70,689 | 0 | 344 |
| 3Dt1c | 86 | 3,636,266 | 63,365 | 275,317 | 0 | 320 |
| 3Dt2c | 84 | 3,819,610 | 69,733 | 247,364 | 0 | 190 |
| 3Dt2j | 75 | 2,813,977 | 71,970 | 195,248 | 0 | 350 |
| 2Dt1l | 16 | 3,996,506 | 345,773 | 1,089,026 | 0 | 289 |
| 1Dt2d | 6 | 5,345,294 | 1,019,507 | 2,288,975 | 0 | 225 |
| 1Dt1h | 4 | 4,336,578 | 1,857,409 | 2,051,556 | 0 | 404 |
| 1Dt100h | 2 | 2,379,473 | 2,349,045 | 2,349,045 | 0 | 224 |
| Source: flye.log files produced in flye_asm analysis, August 27th 2024 | ||||||
| Table 1 - subset of data obtained from CheckM analysis | ||||||||||
| of samples taken from the skin microbiome of Dendrobates tinctorius, including metrics of quality. | ||||||||||
| Accession | Marker lineage | Completeness | Contamination | GC | GC std | Genome size | contigs | Longest contig | Coding density | Predicted genes |
|---|---|---|---|---|---|---|---|---|---|---|
| 1Dt1h | o__Actinomycetales | 100.00 | 0.00 | 65.519% | 4.284% | 4,336,578 | 4 | 2,051,556 | 90.334% | 3,818 |
| 1Dt2d | f__Enterobacteriaceae | 100.00 | 0.90 | 53.889% | 1.270% | 5,345,294 | 6 | 2,288,975 | 87.931% | 4,937 |
| 1Dt100h | o__Actinomycetales | 98.84 | 0.19 | 52.446% | 0.564% | 2,379,473 | 2 | 2,349,045 | 89.586% | 2,071 |
| 2Dt1l | o__Actinomycetales | 98.82 | 2.32 | 67.742% | 0.844% | 3,996,506 | 16 | 1,089,026 | 93.707% | 3,915 |
| 3Dt1c | o__Sphingomonadales | 97.23 | 1.19 | 66.487% | 1.385% | 3,636,266 | 86 | 275,317 | 90.660% | 3,433 |
| 3Dt2j | o__Sphingomonadales | 96.19 | 1.21 | 65.644% | 1.752% | 2,813,977 | 75 | 195,248 | 92.779% | 2,786 |
| 3Dt2c | o__Sphingomonadales | 95.75 | 0.85 | 66.093% | 1.604% | 3,819,610 | 84 | 247,364 | 89.769% | 3,516 |
| 3Dt2h | o__Actinomycetales | 87.98 | 0.51 | 70.327% | 1.317% | 3,006,227 | 166 | 125,244 | 90.324% | 2,978 |
| 1Dt100g | o__Sphingomonadales | 76.77 | 0.00 | 68.105% | 1.332% | 2,104,071 | 153 | 74,994 | 92.645% | 2,179 |
| 2Dt1e | o__Actinomycetales | 44.74 | 0.00 | 71.162% | 1.565% | 2,332,403 | 149 | 70,689 | 89.402% | 2,215 |
| Source: CheckM lineage_wf performed on 2024-12-25 | ||||||||||
| Table 4 - quality analysis from CheckM2 | |||||
| based on samples taken from the skin microbiome of Dendrobates tinctorius | |||||
| Accession | Completeness | Contamination | Completeness model used | Translation table used | Additional notes |
|---|---|---|---|---|---|
| 1Dt1h | 100.00 | 0.31 | Neural Network (Specific Model) | 11 | None |
| 1Dt2d | 100.00 | 0.87 | Neural Network (Specific Model) | 11 | None |
| 2Dt1l | 100.00 | 0.27 | Neural Network (Specific Model) | 11 | None |
| 3Dt2c | 99.94 | 0.39 | Neural Network (Specific Model) | 11 | None |
| 3Dt2j | 99.80 | 0.00 | Neural Network (Specific Model) | 11 | None |
| 1Dt100h | 99.12 | 0.00 | Gradient Boost (General Model) | 11 | None |
| 3Dt1c | 98.34 | 0.86 | Neural Network (Specific Model) | 11 | None |
| 3Dt2h | 86.04 | 0.10 | Neural Network (Specific Model) | 11 | None |
| 1Dt100g | 78.42 | 0.33 | Neural Network (Specific Model) | 11 | None |
| 2Dt1e | 48.29 | 0.08 | Neural Network (Specific Model) | 11 | None |
| Source: CheckM2 predict performed on 2024-12-28 | |||||
For the CheckM vs CheckM2 analysis, it should be noted that ____ is older, being released at ____, whereas the other was released at ____. However, both versions found on hawk are out of date from the modern ________. It would appear that CheckM2 is not a replacement for CheckM, as both are being updated as of 2023. CheckM uses its own version of the database, so i so far dont have a date for it_____. CheckM2 requires me to download my own diamond database beforehand, this was done around the 13th of August 2024, however, the version of the database is from the 23rd of March 2021.
📌 TODO: (after conc of last bit) investigations into KO and protein stuff / just found out checkm is better documented and has some graph output options, so that is something to look into as well